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Abstract 

The example application is a distributed visualiza- 
tion involving a supercomputer and a graphics work- 
station. The visualization computation is performed 
oil a Connection Machine , end the results are rendered 
using a Silicon Graphics u orkstation. The UltraNet 
network installed at NAS allows high-bandwidth com- 
munication between the computers. Ideally, taking ad- 
vantage of the UltraNet is nj more complex than devel- 
oping TCP/IP and Unix DSD socket-type applications 
on a single machine. In oractice f there are several 
problems in developing an cpplication using the Ultra- 
Net. This paper identifies potejitial problems and dis- 
cusses techniques for overcoming them. Performance 
of UltraNet communication is measured and found to 
be 10 MB/sec for SGI VGX workstations. 

1 Introduction 

This paper examines the issues in creating an Ul- 
traNet TCP/IP socket-based distributed application. 
In order to preserve as much generality as possible, the 
discussion is focused on the UltraNet-related aspects 
of the development process. This paper provides: 

• An explanation of the UltraNet. 

• A summary of the socket model used in UltraNet 
communications. 

• The steps in the development of a distributed ap- 
plication. 

• A performance survey of the UltraNet. 

The potential for effective distributed applications 
at the Numerical Aerodynamic Simulation (NAS) fa- 
cility of NASA Ames Research Center grew with the 

*\Vork Supported by NASA Contract NAS 2-12961 


installation of the UltraNet. The UltraNet is a hub- 
based network. The NAS installation currently con- 
sists of nine hubs connected with fiber optic cables, 
forming a ring which supports 1 Gigabit/sec transfer 
rates between the hubs. The hubs connect to over 80 
workstations, supercomputers and other nodes with 
various types of interfaces and cable connections. 

The UltraNet software allows existing socket-based 
communications applications to be relinked and run on 
UltraNet with a minimum of programming effort. Af- 
ter a socket-based program is relinked, however, there 
are still several ways to enhance performance, reliabil- 
ity and programmability. 

Using a distributed visualization application as an 
example, this paper examines application design and 
implementation on the workstation and supercom- 
puter sides, division of work between the machines 
and other issues. Specific details of the rendering and 
graphics algorithms in the example application are not 
emphasized except where they are relevant to the use 
of the UltraNet. 

Distributed graphics applications involving a su- 
percomputer, workstation and high-speed network are 
rapidly becoming an important tool for visualization 
[3, 4, 5, 6]. Computation is done on the supercom- 
puter, and visualization can be done over the network, 
with the displays generated on a workstation. This 
type of distributed graphics application matches the 
right tools for the right tasks. Supercomputers have 
the memory and speed to solve the simulation and 
generate the data to be visualized. Graphics work- 
stations have better tools for developing visualization 
software and their cost and size allow a researcher to 
have one on her desk. 

UltraNet performance varies widely depending on 
the hardware and software using the network. The ef- 
fective transfer rate of data over the UltraNet depends 
primarily on the slowest node involved. With a VME 
based interface on the SGI workstations, one may ex- 
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SERVER 

CLIENT 

socket 

bind 

listen 

accept 

socket 


Table 1: Server/Client Model: Steps in establishing a 
socket connection. 

pect five to 10 megabytes/sec (MB/sec). Timings to 
support this expectation are presented in the section 
on performance. 

2 Basic UltraNet Behavior 

The UltraNet and the device drivers that support it 
rely on a UNIX device model to send and receive data. 
UltraNet connections behave like raw devices. Data is 
moved directly from or to the application memory by 
Direct Memory Access hardware (DMA) or by inter- 
rupt level processing. Hardware ‘packet size’ is trans- 
parent to users. The UltraNet can write whatever 
size buffer you wish to send, up to system-dependent 
limits. Currently, the largest SGI/VME write that 
succeeds is about 3.6 megabytes. 

The model of file I/O is a good approximation of 
UltraNet behavior. In UNIX, the system calls read, 
write, open and close are provided to allow users 
to handle files. Although a file system may have some 
atomic size, users generally do not care what that size 
is. Sockets use the same system calls that file I/O 
uses: read, write, etc. More specifically, UltraNet 
sockets are stream facilities which provide full-duplex 
communication paths between user’s processes and the 
kernel’s interactions with the network hardware. 

The speed of delivery of a read or write depends 
on the speed of the machines involved. The ‘weakest 
link in the chain,’ in terms of I/O throughput, will 
determine the overall speed of the transfer. Data is 
read directly to and from memory. A VME interface 
(e.g., SGI VGX, Convex) to the UltraNet is slower 
than an HiPPI interface (e.g., CM-2) 

3 The Socket Model 

The sockets in the example application are handled 
in a client/server model. One process listens for re- 
quests for connections and makes the connection when 
necessary (‘server’). The other process (‘client’) asks 


for a connection from some server. When the client 
gets the connection, both server and client can read 
and write to the socket. The client must know the 
machine address and port number of the server to be 
able to connect. The socket library calls which estab- 
lish a server and client connection are shown in Table 
1. Note that the server must have completed the first 
three steps before the client does a socket call or the 
connection will fail. 

The UltraNet socket compatibility library supports 
most of the UNIX socket calls. Some minor variations 
listed below are not currently supported: 

• sendmsg, recvmesg 

• shutdown 

• listen(s, n) where n > 1 

• send and receive are not interruptible 

TCP/IP style connections should be used in most 
applications, since they are a reliable connection- 
oriented protocol, as opposed to connectionless UDP 
datagram protocols. UDP does not provide error con- 
trol, flow control or sequencing. Several UNIX books 
give examples of UDP style sockets, which are not gen- 
erally useful for applications sending large amounts of 
data. For more detail on sockets and TCP/IP, see [8]. 

When using an SGI with UltraNet, it is necessary 
to link with the ulsock library. The ulsock library 
provides socket calls that work with UltraNet. A few 
important differences arise when using the ulsock li- 
brary. The ulsock library replaces the dup2 function 
from C with its own dup2. A potential conflict ex- 
ists with the mpc parallel programming library, which 
also replaces dup2. For UltraNet linked programs 
which use both ulsock and mpc, place -lulsock be- 
fore -lrapc on the compile command line (i.e., ensure 
that your program uses the UltraNet dup2). 

To use the UltraNet, it is necessary to attach to the 
host address of the UltraNet interface. Machines with 
UltraNet interfaces have internet addresses dedicated 
to an UltraNet native path, an UltraNet internet pro- 
tocol path and a general internet protocol path. Table 
2 shows typical NAS names of addresses on a given 
network. It is easy to switch between the networks 
by changing addresses of server connections. The per- 
formance discussion of this paper deals only with the 
native UltraNet addresses (i.e., hostnames in the form 
host-u). 
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HOSTNAME 

NETWORK 

host 

host-uip 

host-u 

Ethernet addresses 

UltraNet internet protocol address 

Native UltraNet address 


Table 2: Example NAS Hostnames and their Networks 
for UltraNet Hosts. Native UltraNet is generally the 
fastest; Ethernet is slowest. 


UltraNet Name 

CMFS Name 

AFJNET 

socket 

read 

struct sockaddr 
perror 

CMFS-AFJNET 
CMFS_socket 
CMFS _xead_f lie .always 
struct cmjsockaddr 
CMFS .perror 


Table 3: CMFS Socket Library Name differences 


4 The Distributed Application: 
A Case Study 

The production of a working UltraNet application 
is not extremely complex; uowever, network reliabil- 
ity and access to network resources limits how devel- 
opment takes place. The application examined here 
involves the Connection Machine (CM), and a Silicon 
Graphics VGX (SGI). The Connection Machine side 
requires use of the Connection Machine High Perfor- 
mance Parallel Interface (CM-IIiPPI) processor, the 
CM DataVault, the CM fiont-end and at least one 
Sequencer on the CM itself. The SGI side of the 
application involves an SGI VGX running two pro- 
cesses accessing shared memory buffers governed by 
semaphores. Doth sides depend on the IHtraNet. 

The sum of the parts forms a working UltraNet ap- 
plication, but it is easier 1o debug and develop the 
individual parts of the application separate from one 
another. The amount of hardware involved in the full 
application limits developing and debugging. If the 
network or the DataVault or the CM is not up, the 
full application cannot run Separate development of 
parts allows work to proceed even when all the hard- 
ware elements are not available. As an example, the 
SGI side can operate with data coming from another 
process (on the same machine or from another work- 
station) over the Ethernet. The SGI side can also op- 
erate without the graphics process. For testing con- 
nections from the CM, a simpler socket program is 
used which can isolate network problems and bench- 
mark network response. 

Some other development obstacles include: avail- 
ability of CM time, access x> SGI VGX graphics con- 
sole, CM-IIiPPI and UltraNet hardware problems. A 
list of the major steps in the development process 
shows how the application progressed. 

1. Developed client/server procedures between 
SGIs. 

2. Ported client to CM. 

3. Developed single process SGI application. 


4. Wrote multiple process SGI producer/consumer. 

5. Fused producer/consumer with sockets on SGI. 

6. Wrote serial process to simulate CM for debug- 
ging. 

7. Incorporated CM side. 

8. Integrated and tested full application. 

All communication is based on procedures written 
in Step 1. The code for these routines is given in Ap- 
pendix A. The two major procedures are serverO 
and client (). The client routine takes a hostname 
as an argument and attempts to connect over a pre- 
viously agreed upon port number to the server. The 
port number can be defined in an include file visible 
to both server and client processes. 

Once the basic communication routines are de- 
bugged, it is possible to test UltraNet throughput. 
These tests are discussed below. It is also relatively 
easy to port these communication routines to the 
Connection Machine, since Thinking Machines pro- 
vides analogous library calls in their CM File System 
(CMFS) library [9]. The major difference is in the 
naming. Some examples are shown in Table 3. The 
cmclientO source code is included in Appendix B. 
Both the client and server codes in the appendices are 
based on examples in [8] . 

4.1 Partitioning the Application 

Partitioning the work of a distributed application 
is an important design decision which affects the per- 
formance and utility of the application. The nature 
of the application, along with the relative speeds of 
the computers and networks involved determine how 
partitioning of work should be done. Several possibil- 
ities in dividing the work for interactive visualization 
applications have been explored [3, 5]. 

One technique is to do both simulation and render- 
ing computations on one or more supercomputers and 
display the results on a workstation. This approach al- 
locates heavy computation to the supercomputer and 


3 






Figure 1: Basic architecture of distributed applica- 
tion. The CM communicates data via the OM-EiPPI 
to the UltraNet Hub. The SGI has a process devoted 
to reading buffers from the Hub into Shared memory. 
Another SGI process handles rendering and user in- 
terface. 

relegates the workstation to displaying precomputed 
bitmap images. Sending a full screen color bitmap to 
the workstation may require an extremely fast net- 
work to support animation. This model of partition- 
ing is analogous to the model the X windows system 
uses for distributing graphics. This type of partition- 
ing is useful for computationally expensive rendering 
problems such as volume visualization, where super- 
computers can process the image much faster than a 
workstation can. 

The partitioning for the example application was 
designed to allow the workstation to handle rendering 
and user interaction. The amount of data sent to the 
workstation over the network is small. The Connec- 
tion Machine calculates new data positions, then sends 
3-D coordinates of the positions to the workstation for 
rendering. The workstation handles interaction with 
the user, rotation, scaling and lighting of the visual- 
ization. Figure 1 illustrates the basic architecture of 
the application. 

It is important to balance the amount of network 
traffic and the computation requirements on the work- 
station and CM to get acceptable throughput. The 
CM-HiPPI is also more efficient with buffer sizes 


greater than 256KB (or 4 32-bit floating point val- 
ues on each processor of 1 sequencer). One drawback 
of this behavior is that at transfer sizes where the 
CM-HiPPI and UltraNet are most efficient, the overall 
transfer time can be rather long. If one communicates 
in buffers of 2MB or more from the CM to an SGI, 
each transfer takes around 0.5 secs to complete. This 
is too slow for many graphics applications, so soft- 
ware buffering may be considered to amortize the cost 
of data transfers. 

For the case where a small amount of data needs 
to be sent relatively often, similar problems arise. As 
shown in the performance section below, sending less 
than 256KB of data is non-optimal because of startup 
cost in buffer communication, so packaging data to 
be sent into larger buffers may give better UltraNet 
throughput. The delay time in waiting for enough 
data to send may not be acceptable for interactive 
applications. Another alternative is to use Ether- 
net for small buffers. These tactics are application- 
dependent, because the amount of buffering that will 
help depends on the ratio of computation speed to net- 
work speed and to workstation rendering speed. The 
performance section of this paper gives timings which 
may help developers predict what type of buffering is 
most appropriate for specific applications. 

4.2 Workstation Application Architec- 
ture 

Since the example application requires the SGI to 
render 3-D graphics while also reading UltraNet data 
into memory, a multiple process architecture was used. 
A producer /consumer scheme is used for efficiency. 
The producer is the process which reads data from the 
UltraNet. The producer tries to have data ready for 
the consumer at all times. The consumer ‘consumes’ 
this data when it needs to update the screen. Reading 
data from the UltraNet continuously with one proces- 
sor, while rendering previously read buffers allows the 
application to be more interactive. This tactic is one of 
the suggestions in the UltraNet Network Applications 
Development guide [10], and is especially productive 
on multiple processor machines such as the VGX. 

The producer/consumer code and the networking 
and rendering code are independent units which were 
integrated into one unit after they were debugged. To 
simplify development, the producer /consumer proto- 
type was based on an SGI documentation example [7], 
which gave insight into how the producer /consumer 
shared memory code would work. A second prototype 
was a single process which did two tasks: read the net- 
work data and render data. Later, this prototype was 
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integrated into the two process producer /consumer 
code without major proble ns. Figure 1 shows the 
SGI section of the applicaticn as two processes sharing 
memory. The whole SGI section in Figure 1 is com- 
prised of the integrated prototypes described above. 

The serial rendering and networking code and the 
non-networked two process producer/consumer were 
combined with relative ease, because they had been 
independently debugged and their behavior was well 
defined. The basic software engineering premise of 
prototyping small modules to be incorporated into the 
larger architecture after they have been tested guided 
the application’s development. It was necessary to 
proceed according to this premise because many of 
the hardware and software components were new and 
somewhat unreliable. In other words, guessing at how 
to integrate a large number of unfamiliar concepts 
would not have worked. 

When using shared memory and two processes with 
UltraNet sockets, it is important that the processes’ 
critical sections are handled correctly. If the control 
threads are not handled correctly, the reads and writes 
to the UltraNet will not match up and a form of dead- 
lock will persist. If UltraNet reads and writes are not 
paired, both the processes and the port they use will 
be severely hung 1 because there is no time-out mech- 
anism. Modifying and testing the SGI example before 
integrating network communication code allowed vali- 
dation of the multiple process code without the prob- 
lems of process and network deadlock. 

4.3 Supercomputer Application Architec- 
ture 

The construction of the application for the Con- 
nection Machine was analogous to the process on the 
workstation. Figure 1 shows the CM-2 and the IliPPI 
as independent blocks through which data flows. Al- 
though these are separate hardware units, they also 
provide a model for the partitioning of the CM section 
of the example application. Two modules were devel- 
oped: a standalone particle tracing code (CM-2) and 
a CMFS socket-based communication code (IliPPI). 
The particle code provides ohe data which the HiPPI 
communicates to the UltraNet. The development of 
the particle tracing code is not discussed in this pa- 
per. The socket code, however, illustrates how the 
UltraNet interface may be different from machine to 
machine. 

*This time-out problem was {Jleviated in the CMFS sockets 
package, which greatly reduced the number of dead processes 
and hung TCP ports during development. 


The primary concerns when reading or writing 
network data on the Connection Machine are paral- 
lel/serial data format and byte-ordering. The Connec- 
tion Machine stores a floating point number in paral- 
lel format which must be transposed into serial for- 
mat before sending. Likewise, incoming data must be 
transposed from serial to parallel format before it is 
useful to the CM. Furthermore, the Connection Ma- 
chine has a little endian byte order, while the SGI 
VGX has big endian byte order, which means bytes 
must be swapped for data to be the same on the CM 
and SGI. Byte swapping can be done quickly for large 
amounts of data in parallel on the CM. The byte- 
swapping transformation is given by: 

ADCD — * DC BA 

If the byte swapping is done on the CM, it must be 
done before the data is transposed to serial format. 

Currently, the CMFS _transpose_al ways call, which 
does the transpose from serial to parallel or vice-versa, 
is somewhat slow and hampers the effective perfor- 
mance of transfers to and from the CM. Future soft- 
ware releases should improve the performance of this 
transpose. 

The CM-IIiPPI handles UltraNet data for the CM. 
The CM-IIiPPI is a Sun 4 which runs the socket server 
daemon. The socket server handles all socket connec- 
tions to and from the CM, so the socket server must be 
running or the CM cannot access the UltraNet. When 
the CM-IIiPPI was first installed there were several 
problems with the socket server and the hardware in- 
terfaces on the CM-IIiPPI. The reliability of the soft- 
ware and hardware improved as bugs were found and 
fixed by Thinking Machines. To continue development 
when the CM-IIiPPI was down, a “CM simulator” was 
written which ran on a VGX. This process simply read 
pre-computed data from disk and connected to the 
SGI process with UltraNet sockets. 

5 UltraNet Performance 

There are several factors which affect an applica- 
tion’s UltraNet throughput. Most influential is the 
hardware interface to the UltraNet. The UltraNet 
as configured at NAS supports I gigabit /sec transfer 
rates between its hubs and 250 megabits/sec trans- 
fers (about 32MB/sec) between nodes. As mentioned 
above, the SGI VME interfaces and the CM-IIiPPI in- 
terface do not provide 100% of this throughput. When 
transferring data over the UltraNet, the slowest inter- 
face sending or receiving data determines the maxi- 
mum rate for the transfer. In the application discussed 
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in this paper, the SGI VME interface is the bottleneck 
in transfer speed. The VME interface is the hardware 
that accesses the SGEs bus and communicates data to 
the UltraNet hub over fiber-optic connections. 

Given a fixed set of hardware, the key parameters 
which affect performance are: 

1. Size of reads/writes. 

2. Network traffic. 

3. Data path from host to host. 

Of these, the first — read/ write size — is both user 
determinable and very influential in achieving optimal 
performance. Since users on given hosts cannot easily 
change network traffic or the data path of the transfer, 
these factors are not examined. 

5.1 SGI VME Interface Performance 

The UltraNet is connected to several dozen SGI’s 
at NAS. The performance of the SGI and VME In- 
terface to the UltraNet will affect many distributed 
applications written at NAS. A series of timings of 
transfer rates was conducted for read and write direc- 
tions from SGI to SGI. Tests were also done on single 
machines in a ‘loopback’ manner. The results of these 
benchmarks as well as a comparison of Ethernet and 
UltraNet transfer rates are presented in the following 
discussion. Timings for reads and timings for writes 
are nearly equivalent so that read and write operations 
can be considered equivalent in performance. These 
timings demonstrate: 

• Reading or writing buffers smaller than 256 kilo- 
bytes (KB) is inefficient, and sizes of 1 to 2 
megabytes consistently give best performance for 
the UltraNet. 

• Ethernet is more efficient than UltraNet for small 
data transfers. 

• The DMA on the SGIs affects performance of 
large transfers. 

• The largest size buffer the SGI VME interfaces 
reliably accept is about 3.6MB. 

The processor and I/O facilities of a 33Mhz SGI 
VGX can support transfer rates of at most I2MB/sec. 
Figure 2 shows that VME to VME transfer rates of 
10-12MB/sec can be sustained when buffer sizes are 
larger than 256KB. Two machines communicating to 
each other over UltraNet can each support this rate 
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Figure 2: UltraNet transfers of 30 megabytes of data 
from SGI VGX to SGI VGX on the same UltraNet 
Hub. Points indicate size of buffer transferred, start- 
ing with 32KB , ending at 2 megabytes. 

due to the combination of their processor(s) speed and 
the VME I/O performance. When two machines are 
involved, each can devote its full I/O throughput to 
reading or writing. 

UltraNet is not as fast as Ethernet when transfer- 
ring buffers smaller than 10KB. The overhead in us- 
ing UltraNet is greater than Ethernet’s overhead for 
these small buffer sizes. Figure 3 demonstrates that 
EtherNet is over twice as fast as UltraNet for buffer 
sizes between 10 bytes and 5KB. Ethernet’s maximum 
speed is reached at 1KB. UltraNet betters Ethernet’s 
maximum speed at sizes above 10KB. If small buffers 
are being transferred over and over, Ethernet provides 
better throughput. It is possible to have both types 
of connections active at the same time. As an exam- 
ple, one might send control information (e.g., mouse 
position, visualization parameters) between two ma- 
chines over Ethernet, while concurrently communicat- 
ing data over UltraNet. 

A ‘loopback’ test on a single machine shows that 
UltraNet uses the VME interface, even when it is read- 
ing and writing to local memory. Figure 4 shows that 
writing from one process to another on the same ma- 
chine limits the write speed to around 6 MB/sec. This 
is due to the single VME interface doing both reads 
and writes. The two machine SGI rates of 12 MB/sec 
can only be achieved when a single VME I/O board 
is devoted to handling a single connection. 

The DMA hardware on the SGI influences the Ul- 
traNet transfer rate. The DMA on the SGI does the 
read or write from user memory to the UltraNet de- 
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Figure 3: Transfers of 3 megabytes of data from SGI 
VGX to SGI VGX over Ethernet and UltraNet. Eth- 
ernet is more than twice as fast as UltraNet for buffer 
sizes smaller than 5KB. At buffer sizes of 10KB Ultra- 
Net is as fast as Ethernet. 
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Figure 4: 30 megabytes transferred between two pro- 
cesses on the same workstat on. Peak rates are limited 
by one VME I/O channel handling reads and writes for 
both processes simultaneously. Compared to speeds 
for two machine transfers, local transfers are about 
half as fast. 


vice. The performance of reads and writes may be 
limited by the DMA if the DMA’s address space fills 
up with data from the read or write. 

The nature of the DMA limitation is hard to 
demonstrate with two machines because the sockets 
will not handle buffers large enough to swamp the 
DMA. Single machine loopback tests doing reads and 
writes using a single DMA at the same time reveal a 
performance limitation. Transfer rates suffer once the 
combination of read and write buffers approaches a 
certain size. Figure 5 shows for local buffer transfer 
sizes greater than 2.5 MB there is a large drop-off in 
buffers transferred per second at the 2.5 MB buffer 
size. 

Since two machines communicating have twice as 
much DMA memory space as one machine, and one 
machine encounters problems at 2.5MB, problems 
may arise when sending buffers greater than 5MB be- 
tween two machines. It is impossible to confirm this 
hypothesis on two machines because the current Ul- 
traNet drivers for SGI will not complete 5MB trans- 
fers between VME interfaces. Eventually, however, 
the DMA limitation may prove problematic since the 
UltraNet protocol specification does allow buffer sizes 
up to 64MB to be sent. 

These large buffer cases are not too worrisome for 
many application developers, since sending this much 
data (4MB or more) is too slow for interactive use and 
is a large amount of data for a workstation to process. 
For example, if an application needed to send 4MB per 
frame for graphics, the UltraNet and VME interfaces 
would be too slow to support animation frame rates. 

The hardware design of the UltraNet VME board 
also influences throughput, but very few specific de- 
tails are available about the board’s architecture. Ul- 
traNet is a proprietary system, and documents ex- 
plaining the interface board are not generally avail- 
able. 

UltraNet performance between various worksta- 
tions, minisupercomputers and supercomputers was 
measured by an Ultra Network Technologies employee 
before the installation of the full UltraNet configura- 
tion at NAS [2]. This work contains more detailed 
explanation and analysis of performance for Sun, Al- 
liant and Convex VME Interfaces, but does not cover 
SGI performance. 

5.2 CM to SGI Timings 

The Connection Machine has an IliPPI based inter- 
face and two 32MB/sec I/O busses able to connect to 
the UltraNet. Since the UltraNet supports 32MB/sec 
sustained, while the CM supports at most 64MB/sec 
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Figure 5: Buffers per second transfer rate for SGI 
between two machines and between two processes on 
same machine. Two machines are about twice as fast 
as local machine transfers for buffers smaller than 2.5 
MB. Above 2.5 MB, two machine transfers are signif- 
icantly faster. 

(2 busses x 32MB/sec per bus), the maximum transfer 
rate from the CM to any UltraNet connected device 
is 32MB/sec. Transfer at these rates are theoretically 
possible between the CM and the Cray. Becker and 
Dagum investigated this type of connection for rela- 
tively small transfers [1]. 

For applications involving workstations, the work- 
station I/O will generally determine the transfer rate. 
Figure 6 illustrates average throughput from CM to a 
VGX. In these tests, conducted using CM system soft- 
ware version 6.0, the SGI read 30 megabytes of data 
from the CM in one megabyte chunks. The time to 
byte-swap and convert from parallel CM representa- 
tion to serial representation (both done on the CM) 
is included in the timings. The timings show perfor- 
mance under 10-12MB/sec for all sizes of writes. The 
parallel to serial transpose is relatively slow, and limits 
overall performance. In release 6.1 of the CM System 
software, the efficiency of the parallel to serial trans- 
pose should be much better and throughput should 
therefore be closer to the SGI/VME maximum 10- 
12MB/sec. 

A transfer of a 32- bit floating point number from 
each processor of one sequencer (8,192 processors) on 
the CM contains a total of 32KB of data. This is 
the minimum effective size the CM-HiPPI can trans- 
fer. This limitation may affect performance for codes 
which need to transfer a smaller amount of data. 
The CMFS sockets package is used when transferring 
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Figure 6: One Connection Machine sequencer writing 
to SGI VGX over UltraNet. Mean, min and max of 
10 tests at each size. VGX read buffer size is held 
constant at one megabyte. Lowest rate is for 512 bytes 
sent from each of 8,192 processors or a total of 4MB. 
Highest rate is for 16k from each processor (128MB 
total). 

data from each processor to the UltraNet. Currently, 
CMFS sockets are only compatible with the fieldwise 
model of the CM. Using the UltraNet from any slice- 
wise program will be highly inefficient until the soft- 
ware directly supports the slicewise data layout. Fur- 
ther investigation into CM-HiPPI and CM to Ultra- 
Net performance is being postponed until the more 
efficient parallel to serial transpose software arrives, 
since the transpose currently dominates other factors 
in determining CM transfer rates. 


6 Summary 

Taking advantage of the UltraNet for distributed 
applications is not a trivial task; however, it is be- 
coming easier. This paper has examined the develop- 
ment of one such application. This application demon- 
strated that development for a relatively new network 
and combination of computer architectures is made 
easier when it is done piece by piece. Comprehension 
of UltraNet behavior was gained by running simple 
benchmarks. The communications primitives used for 
these benchmarks were later modified to form the ba- 
sis for the application’s communication code. Later, 
single process prototypes were modified to run in the 
full networked, multiple process application. By iso- 
lating and prototyping, software could be developed 
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even when hardware was ur available. 

The maximum throughput of the UltraNet depends 
on the hardware attached. The SGI VOX can utilize 
10-12MB/sec of UltraNet bandwidth. The hardware 
of the SGI/VME interface performs as well as other 
workstations’ interfaces bu \ is limiting compared to 
the speeds of IliPPI and supercomputer UltraNet in- 
terfaces. Furthermore, the DMA of the SGI VGX may 
limit throughput when handling buffer sizes larger 
than 2.5MB on one machine, or 5MB between two 
machines. It is easy to adapt current socket programs 
to use UltraNet sockets, although to attain maximum 
throughput, devoting a process to network handling is 
a good idea. 

The Connection Machine is also easily accessible 
to and from the UltraNet. The CM can take advan- 
tage of greater bandwidth to the UltraNet, because 
it is not limited like the workstations or the Convex 
with a VME interface. CM-IIiPPI software is well- 
designed but currently has several limitations; there 
is no slicewise interface, a slow transpose is required, 
and language support is limited. Thinking Machines 
is working to remedy these problems. 

The current level of UlCaNet performance is high 
enough to support a variety of applications. For a 
given application, making good choices concerning 
how much data to transfer and when to transfer data 
will improve application throughput. It must be kept 
in mind that UltraNet and the related pieces are rel- 
atively new and immature. Undoubtedly, as UltraNet 
matures, software toolkits will incorporate features to 
make its use transparent to the programmer. 
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